Determining the value of a home is becoming increasingly mechanized. With the maturation of statistical modeling techniques and machine learning, a process that was once driven largely by negotiations between individual actors—and the hard-to-measure factor of human taste—can now be abstracted by computers. The purpose of this project is to do just that: to devise a machine learning model that can accurately predict the sale price of a home based on both intrinsic and environmental factors.
Before any discussion of the techniques used to design this model, it is important to emphasize what, and whom, such a thing benefits. We do not believe that machines should learn merely for learning’s sake. A model that can predict home prices, however, stands to benefit many interests. The most obvious, of course, is home buyers and sellers who can refer to the model as a benchmark, obviating much of the needless back-and-forth that characterizes real estate negotiations.
But there are others who would be better off, too. Neighbors ought to have a sense of their local real estate market given that the value of one home tends to influence that of the next. Local governments, agencies, and other public service providers would also be beneficiaries since they need an accurate measure of the local economy to craft good policy. Finally, foreign investors and businesses that funnel private capital and create jobs in cities are not usually equipped with the local intelligence necessary to make investment decisions, relying inefficiently on the word of locals. An automated valuation model can help to give them that information easily.
The model used in this project is what is known as a hedonic model. Not to be confused with the ancient Greek school of philosophy, hedonic models refer to predictive models that synthesize a variety of discrete factors to derive a final prediction. Here, our hedonic model for home price predictions takes input factors from three primary categories: (1) the physical attributes of each property, (2) nearby public amenities or disamenities, and (3) the clustering of home prices in physical space (also known in the real estate industry as “comparables” or “comps”). A detailed list of the specific factors used in our model follows in the Data section below.
The accuracy of our results can be assessed using a host of different metrics, but the most salient is the R^2 value, measuring how much variation in price is explained by the model. The final model here returned an R^2 value of approximately 0.49, indicating that just under 50% of the variation was explained by the model.
Every piece of data that was fed into the model falls under one of the three categories stated above (see Introduction). Although the bulk of the data on each home’s internal attributes was already given (e.g. square footage, number of rooms, etc.), data from the second and third categories—that is, nearby amenities or disamenities and spatial patterns—needed to be sourced externally.
Public data sources that were used to do so include the U.S. Census Bureau for demographic information, the State of Colorado’s open data portal for the locations of schools, and Boulder County’s open data portal for local ZIP codes and points of interest for recreation such as trailheads.
The summary statistics of the full model are presented below.
| Dependent variable: | |
| price | |
| med_HH_Income | -1.925*** |
| (0.497) | |
| pct.over75K | 1,073,229.000*** |
| (150,718.600) | |
| pct.Information | -2,203,228.000*** |
| (360,495.200) | |
| pct.Finance | 1,594.924 |
| (262,495.100) | |
| pct.Professional | -181,448.900 |
| (152,270.900) | |
| pct.Ed_Health | -446,681.800*** |
| (135,859.400) | |
| nbrBedRoom | 8,919.376* |
| (5,277.324) | |
| nbrFullBaths | -20,167.680*** |
| (6,382.267) | |
| TotalFinishedSF | 159.369*** |
| (8.851) | |
| AcDscrEvaporative Cooler | 52,784.020 |
| (100,480.700) | |
| AcDscrNo AC | 53,806.290 |
| (96,560.950) | |
| AcDscrWhole House | 63,817.470 |
| (96,562.460) | |
| Age | 1,225.806*** |
| (226.321) | |
| schools_nn3 | -32.421*** |
| (4.985) | |
| trailheads_nn5 | -7.301 |
| (5.761) | |
| dist_FR | -14.970*** |
| (2.302) | |
| qualityCodeDscrAVERAGE + | -32,314.160* |
| (18,231.170) | |
| qualityCodeDscrAVERAGE ++ | 31,199.560* |
| (18,721.230) | |
| qualityCodeDscrEXCELLENT | 1,209,125.000*** |
| (42,347.180) | |
| qualityCodeDscrEXCELLENT + | 1,501,140.000*** |
| (94,512.870) | |
| qualityCodeDscrEXCELLENT++ | 2,053,472.000*** |
| (74,958.720) | |
| qualityCodeDscrEXCEPTIONAL 1 | 1,178,518.000*** |
| (94,737.590) | |
| qualityCodeDscrEXCEPTIONAL 2 | 1,989,543.000*** |
| (250,755.900) | |
| qualityCodeDscrFAIR | -74,775.010 |
| (45,893.580) | |
| qualityCodeDscrGOOD | 70,174.540*** |
| (13,001.650) | |
| qualityCodeDscrGOOD + | 109,881.300*** |
| (21,877.160) | |
| qualityCodeDscrGOOD ++ | 205,970.400*** |
| (19,933.570) | |
| qualityCodeDscrLOW | -124,112.600 |
| (97,682.940) | |
| qualityCodeDscrVERY GOOD | 300,628.500*** |
| (21,468.180) | |
| qualityCodeDscrVERY GOOD + | 617,758.800*** |
| (37,791.240) | |
| qualityCodeDscrVERY GOOD ++ | 681,368.300*** |
| (30,880.430) | |
| designCodeDscr2-3 Story | -27,609.520** |
| (11,101.130) | |
| designCodeDscrBi-level | 49,911.570** |
| (25,385.950) | |
| designCodeDscrMULTI STORY- TOWNHOUSE | -132,319.300*** |
| (15,917.080) | |
| designCodeDscrSplit-level | 19,440.710 |
| (16,834.630) | |
| ZipCode80025 | -364,412.500 |
| (283,330.700) | |
| ZipCode80026 | -283,686.300 |
| (216,120.600) | |
| ZipCode80027 | -260,814.200 |
| (216,935.800) | |
| ZipCode80301 | -163,691.500 |
| (217,934.200) | |
| ZipCode80302 | 156,545.000 |
| (219,405.000) | |
| ZipCode80303 | -105,060.900 |
| (218,506.400) | |
| ZipCode80304 | 143,416.500 |
| (219,907.900) | |
| ZipCode80305 | -79,906.970 |
| (219,864.600) | |
| ZipCode80403 | -127,448.300 |
| (227,619.200) | |
| ZipCode80422 | -37,943.930 |
| (264,007.600) | |
| ZipCode80455 | -215,880.600 |
| (232,920.400) | |
| ZipCode80466 | -224,580.500 |
| (217,922.700) | |
| ZipCode80471 | -382,909.300 |
| (490,554.700) | |
| ZipCode80481 | -65,293.540 |
| (225,752.000) | |
| ZipCode80501 | -372,582.700* |
| (216,143.100) | |
| ZipCode80503 | -423,489.700* |
| (216,770.900) | |
| ZipCode80504 | -385,146.600* |
| (216,204.100) | |
| ZipCode80510 | 290,204.600 |
| (235,334.200) | |
| ZipCode80516 | -406,437.100* |
| (216,585.600) | |
| ZipCode80540 | -309,398.300 |
| (222,118.100) | |
| ZipCode80544 | -393,245.800 |
| (305,797.900) | |
| Constant | 861,197.800*** |
| (245,352.800) | |
| Observations | 11,252 |
| R2 | 0.518 |
| Adjusted R2 | 0.516 |
| Residual Std. Error | 428,929.700 (df = 11195) |
| F Statistic | 214.978*** (df = 56; 11195) |
| Note: | p<0.1; p<0.05; p<0.01 |
The correlation matrix below depicts how related or unrelated each feature is to the others. For instance, how far a home is from the Front Range appears to be negatively correlated to price, i.e. closeness to the Front Range corresponds to higher price. (Importantly, note that correlation is distinct from causation and that the matrix only serves as a helpful guide for determining relevant factors for the model.)
numericVars <-
select_if(st_drop_geometry(boulder.sf), is.numeric) %>% na.omit()
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
insig = "blank") +
labs(title = "Correlation across numeric variables",
caption = "Figure 1.1")
Scatterplots are another means of showing correlation. The four features shown here are (moving clockwise beginning from the top left): (1) percentage of the population with income greater than $75,000 per year, (2) percentage of the population associated with professional or management services, (3) the average of the distances to the nearest five trailheads, and (4) the average of the distances to the nearest three schools.
The first two demographic features are positively correlated with price, meaning that a greater share of the population with higher income and in professional services corresponds to higher price. Contrariwise, homes that are further away from trailheads and schools correspond to lower prices.
st_drop_geometry(boulder.sf) %>%
dplyr::select(price, pct.over75K, pct.Professional, schools_nn3, trailheads_nn5) %>%
filter(price <= 4000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables",
caption = "Figure 1.2") +
plotTheme()
Although homes are sold for a wide range of prices across Boulder County, the prices that these homes have fetched tends to cluster in space. The following map demonstrates this phenomenon.
# Price per square foot
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = boulder.sf, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder.sf,"price"),
name="Quintile\nBreaks") +
labs(title="Home Sale Prices, Boulder County",
caption = "Figure 1.3") +
mapTheme()
Just as the model’s outcome variable, home sale price, can be depicted on the map, so too can the features that the model will use to predict that outcome. Three of these features are mapped here.
First, the distance between each home and the Front Range is color-coded accordingly below, with the Front Range itself included for reference.
# Distance from the Front Range
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = FrontRange, colour = "#2d6a4f", size = 2) +
geom_sf(data = boulder_homes_observed, aes(colour = q5(dist_FR)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder_homes,"dist_FR"),
name="Quintile\nBreaks") +
labs(title="Home distance from Front Range",
caption = "Figure 1.4") +
mapTheme()
One of the U.S. Census variables, median household income, is mapped according by Census tract.
# Median Household Income in Boulder County
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = acsTractsBoulder.2019.sf, aes(fill = med_HH_Income)) +
scale_fill_viridis_b() +
labs(title = "Median Household Income in Boulder County",
subtitle = "by Census Tract",
caption = "Figure 1.5") +
mapTheme()
Finally, the distribution of schools, both public and private, in Boulder County is depicted below.
# Distance from the Front Range
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = boulder_schools, colour = "#2d6a4f") +
labs(title = "Schools in Boulder County",
caption = "Figure 1.6") +
mapTheme()
# Sale Price + zipcodes
ggplot() +
geom_sf(data = boulder_boundary, fill = NA, colour = "black") +
geom_sf(data = acsTractsBoulder.2019.sf, fill = NA, colour = "#55286F") +
geom_sf(data = boulder_homes_observed, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder_homes,"price"),
name="Quintile\nBreaks") +
labs(title="Home Sale Prices + Zip Code Areas",
caption = "Figure 1.7") +
mapTheme()
The type of statistical model used here is called a Linear Regression, or an Ordinary Least Squares regression. This type of model synthesizes a list of relevant components to produce a function representing the distribution of individual observations. In this case, the model was derived from the various data features described above, factoring in each home’s internal and environmental characteristics to predict price.
The strength of a prediction model can be distilled down to two interrelated qualities: accuracy and generalizability.
Accuracy refers to the ability of a model to produce predicted values that are as close as possible to the actual observed values. To test the accuracy of the model here, the original data was divided into two groups. One group, the training set, represented 75% of the original data and was used to create, or “train,” the regression model. The second group, the test set, represented the remaining 25% of the data and was used to measure how close, or not, the predictions came to the corresponding observed values. Accuracy can be measured by the metrics of “mean absolute error” (MAE) and “mean absolute percent error” (MAPE).
Generalizability refers to the ability of a model to make predictions based on new, unseen data. The generalizability of the model here was tested using a method known as cross-validation. The dataset was divided into 100 subsets, and each subset was further divided into training and testing sets. Errors across all 100 subsets were averaged to capture how well the model performs on data that it has yet to encounter.
The following summary table presents the results of the linear regression on the training data set.
| Dependent variable: | |
| price | |
| med_HH_Income | -2.244*** |
| (0.608) | |
| pct.over75K | 1,166,966.000*** |
| (184,708.600) | |
| pct.Information | -2,544,788.000*** |
| (437,167.000) | |
| pct.Finance | 15,107.560 |
| (318,801.800) | |
| pct.Professional | -177,234.300 |
| (185,614.800) | |
| pct.Ed_Health | -447,466.400*** |
| (164,602.700) | |
| nbrBedRoom | 10,584.790 |
| (6,465.900) | |
| nbrFullBaths | -26,951.370*** |
| (7,701.113) | |
| TotalFinishedSF | 162.138*** |
| (10.635) | |
| AcDscrEvaporative Cooler | 61,462.930 |
| (108,280.800) | |
| AcDscrNo AC | 59,295.340 |
| (103,736.400) | |
| AcDscrWhole House | 60,411.740 |
| (103,734.600) | |
| Age | 936.334*** |
| (274.080) | |
| schools_nn3 | -29.274*** |
| (5.997) | |
| trailheads_nn5 | -10.707 |
| (6.929) | |
| dist_FR | -13.671*** |
| (2.785) | |
| qualityCodeDscrAVERAGE + | -36,815.380* |
| (21,834.160) | |
| qualityCodeDscrAVERAGE ++ | 20,464.010 |
| (22,297.830) | |
| qualityCodeDscrEXCELLENT | 1,210,563.000*** |
| (50,626.680) | |
| qualityCodeDscrEXCELLENT + | 1,427,054.000*** |
| (109,284.500) | |
| qualityCodeDscrEXCELLENT++ | 1,861,940.000*** |
| (85,542.050) | |
| qualityCodeDscrEXCEPTIONAL 1 | 1,149,468.000*** |
| (107,483.000) | |
| qualityCodeDscrEXCEPTIONAL 2 | 1,990,171.000*** |
| (269,800.200) | |
| qualityCodeDscrFAIR | -98,568.580* |
| (53,414.200) | |
| qualityCodeDscrGOOD | 60,064.260*** |
| (15,832.510) | |
| qualityCodeDscrGOOD + | 103,969.300*** |
| (26,379.780) | |
| qualityCodeDscrGOOD ++ | 198,980.300*** |
| (24,016.660) | |
| qualityCodeDscrLOW | -138,747.500 |
| (110,655.200) | |
| qualityCodeDscrVERY GOOD | 291,709.100*** |
| (25,939.220) | |
| qualityCodeDscrVERY GOOD + | 608,531.300*** |
| (44,908.030) | |
| qualityCodeDscrVERY GOOD ++ | 672,810.800*** |
| (36,828.180) | |
| designCodeDscr2-3 Story | -24,415.950* |
| (13,495.780) | |
| designCodeDscrBi-level | 48,895.660 |
| (29,870.130) | |
| designCodeDscrMULTI STORY- TOWNHOUSE | -127,779.600*** |
| (19,323.430) | |
| designCodeDscrSplit-level | 16,070.850 |
| (20,085.200) | |
| ZipCode80025 | -312,810.800 |
| (305,551.100) | |
| ZipCode80026 | -279,657.800 |
| (232,256.200) | |
| ZipCode80027 | -244,639.600 |
| (233,343.600) | |
| ZipCode80301 | -125,181.200 |
| (234,656.900) | |
| ZipCode80302 | 191,607.800 |
| (236,717.100) | |
| ZipCode80303 | -65,854.520 |
| (235,418.600) | |
| ZipCode80304 | 160,602.200 |
| (237,354.500) | |
| ZipCode80305 | -53,060.890 |
| (237,285.000) | |
| ZipCode80403 | -137,589.300 |
| (247,045.000) | |
| ZipCode80422 | -42,427.790 |
| (290,110.600) | |
| ZipCode80455 | -199,819.200 |
| (251,932.100) | |
| ZipCode80466 | -231,143.700 |
| (234,636.400) | |
| ZipCode80471 | -363,424.400 |
| (527,629.800) | |
| ZipCode80481 | -61,651.600 |
| (244,252.300) | |
| ZipCode80501 | -365,969.700 |
| (232,286.000) | |
| ZipCode80503 | -420,085.000* |
| (233,137.300) | |
| ZipCode80504 | -376,917.600 |
| (232,378.900) | |
| ZipCode80510 | 258,247.400 |
| (257,187.400) | |
| ZipCode80516 | -404,424.200* |
| (232,919.100) | |
| ZipCode80540 | -303,798.100 |
| (240,025.400) | |
| ZipCode80544 | -390,646.000 |
| (328,581.000) | |
| Constant | 869,273.200*** |
| (266,092.900) | |
| Observations | 8,793 |
| R2 | 0.492 |
| Adjusted R2 | 0.488 |
| Residual Std. Error | 459,982.400 (df = 8736) |
| F Statistic | 150.929*** (df = 56; 8736) |
| Note: | p<0.1; p<0.05; p<0.01 |
A summary of the mean absolute error and mean average percent error (MAPE) for the price prediction on the test data set is shown below.
Graphic representations of the results of the test set prediction are shown below.
# histogram of absolute errors
ggplot(boulder.test, aes(x = price.abserror)) +
geom_histogram(binwidth=10000, fill = "green", colour = "white") +
scale_x_continuous(limits = c(0, 1000000)) +
labs(title = "Distribution of prediction errors for single test",
x = "Sale Price Absolute Error", y = "Count") +
plotTheme()
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
reg.cv <-
train(price ~ ., data = st_drop_geometry(boulder.sf),
method = "lm", trControl = fitControl, na.action = na.pass)
K-fold cross validation with 100 folds is used to explore the generalizability of this model. A histogram of the mean average error across the 100 folds is shown below.
# histogram of cross validation MAE
mae <- data.frame(reg.cv$resample[,3]) %>%
rename(mae = reg.cv.resample...3.)
ggplot(mae, aes(x = mae)) +
geom_histogram(binwidth=10000, fill = "orange", colour = "white") +
scale_x_continuous(labels = c(0, 100000, 200000, 300000, 400000, 500000),
limits = c(0, 500000)) +
labs(title = "Distribution of MAE",
subtitle = "k-fold cross validation; k = 100",
x = "Mean Absolute Error", y = "Count") +
plotTheme()
The prices predicted for the test set are plotted against the actual sale prices for the test set in the figure below.
ggplot(boulder.test) +
geom_point(aes(price, price.predict)) +
geom_smooth(aes(price, price), colour = "orange") +
geom_smooth(method = "lm", aes(price, price.predict), se = FALSE, colour = "green") +
labs(title = "Predicted sale price as a function of observed price",
subtitle = "Orange line represents a perfect prediction; Green line represents prediction",
x = "Observed Sale Price", y = "Predicted Sale Price") +
plotTheme()
Residual absolute errors for the test set are mapped onto Boulder County below.
ggplot() +
geom_sf(data = boulder_boundary, fill = "grey") +
geom_sf(data = boulder.test, aes(colour = q5(price.abserror)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder.test,"price.abserror"),
name="Quintile\nBreaks") +
labs(title="Test set absolute price errors",
caption = "Figure X.X") +
mapTheme()
Because of the geographically clustered nature of real estate, errors in price predictions tend also to cluster in space. This phenomenon is known as the “spatial lag.” The following plot depicts the spatial lag of errors in the model.
coords.test <- st_coordinates(boulder.test)
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
boulder.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.error)) %>%
ggplot(aes(lagPriceError, price.error)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
labs(title = "Error as a function of the spatial lag of price errors") +
plotTheme()
The clustering effect of home prices—the technical term is “spatial autocorrelation”—can also be demonstrated by the statistic known as Moran’s I. A Moran’s I that nears positive 1 is an indication of clustering, whereas a 0 value indicates a random distribution. In the figure below, the observed Moran’s I, depicted in orange, is contrasted with 999 randomly distributed values, showing that home prices in Boulder do indeed cluster in space.
moranTest <- moran.mc(boulder.test$price.error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
allPredictions <- boulder_homes %>%
mutate(predictions = predict(reg1, boulder_homes)) %>%
dplyr::select(predictions)
ggplot() +
geom_sf(data = boulder_boundary, fill = "grey") +
geom_sf(data = allPredictions, aes(colour = q5(predictions)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(allPredictions,"predictions"),
name="Quintile\nBreaks") +
labs(title="Predictions for all homes in the dataset, Boulder County",
caption = "Figure X.X") +
mapTheme()
There is something so strange going on in this one zip code.
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(mean.MAPE = mean(price.ape, na.rm = T)) %>%
ungroup() %>%
left_join(boulder_zips) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = mean.MAPE)) +
geom_sf(data = boulder.test, colour = "black", size = .5) +
scale_fill_gradient(low = palette5[1], high = palette5[5],
name = "MAPE") +
labs(title = "Mean test set MAPE by Zip Code") +
mapTheme()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
We need the scatterplot of MAPE by zip of mean price by zip
testError_by_zips <-
left_join(
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(meanPrice = mean(price, na.rm = T)),
st_drop_geometry(boulder.test) %>%
group_by(ZipCode) %>%
summarize(MAPE = mean(price.ape)))
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "ZipCode"
testError_by_zips %>%
kable() %>% kable_styling()
| ZipCode | meanPrice | MAPE |
|---|---|---|
| 80026 | 635384.3 | 0.1809021 |
| 80027 | 722815.3 | 0.1935969 |
| 80301 | 828126.3 | 0.1925919 |
| 80302 | 1239133.0 | 0.3064021 |
| 80303 | 854227.0 | 0.2154622 |
| 80304 | 1432077.5 | 0.2781936 |
| 80305 | 917100.0 | 0.2027842 |
| 80403 | 425888.9 | 0.1832423 |
| 80422 | 535000.0 | 0.1743771 |
| 80455 | 618750.0 | 0.1385517 |
| 80466 | 550730.6 | 0.2435732 |
| 80481 | 408071.4 | 0.3302264 |
| 80501 | 434675.0 | 0.1999699 |
| 80503 | 696281.1 | 0.1877800 |
| 80504 | 510797.5 | 0.1793604 |
| 80510 | 268088.9 | 0.3440662 |
| 80516 | 626238.3 | 0.2524920 |
| 80540 | 484995.0 | 1.7348011 |
ggplot(testError_by_zips) +
geom_point(aes(meanPrice, MAPE)) +
geom_smooth(method = "lm", aes(meanPrice, MAPE), se = FALSE, colour = "green") +
labs(title = "MAPE by Zip Code as a function of mean price by Zip Code",
x = "Mean Home Price", y = "MAPE") +
plotTheme()
The model’s generalizability can be evaluated in relation to factors on the map. Below, two Census factors are depicted in Boulder County: race as a function of white vs. non-white and income level.
That Boulder County has a relatively racially homogeneous makeup would seem to indicate that the model is fairly generalizable on that score. Variations in income might present an obstacle to generalizability, in contrast.
(The greatest challenge to the model’s generalizability comes in the urban-rural distinction, however, as discussed later in this project.)
boulder_tracts19 <-
get_acs(geography = "tract", year = 2019,
variables = c("B01001_001E","B01001A_001E","B06011_001"),
geometry = TRUE, state = "CO", county = "Boulder", output = "wide") %>%
st_transform('ESRI:102254') %>%
rename(TotalPop = B01001_001E,
NumberWhites = B01001A_001E,
Median_Income = B06011_001E) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(Median_Income > 32322, "High Income", "Low Income"))
grid.arrange(ncol = 2,
ggplot() + geom_sf(data = na.omit(boulder_tracts19),
aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() + geom_sf(data = na.omit(boulder_tracts19),
aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Income Context") +
labs(title = "Income Context") +
mapTheme() +
theme(legend.position="bottom"))
The above results suggest that the model was effective in some respects and defective in others. In sum, the model was able to predict just under 50% of the variation in prices. The accuracy of the model varied widely according to the feature. The two that clearly outperformed the rest were distance to schools and the Front Range. Both were highly statistically significant and contributed substantially to the model, particularly when considered in the aggregate.
Conversely, the ZIP codes of each home were, surprisingly, not especially significant on their own. Despite one’s common sense intuition about the importance of a house’s ZIP code, the results suggest that on an individual basis, ZIP codes were not strongly determinative of price. That said, it is notable that when in the process of modeling ZIP codes were removed from the model, the overall accuracy declined markedly. That would indicate that ZIP codes, while relatively insignificant on a per-property basis, are integral to the model as a whole.
Both the strength and weakness of the model can be attributed to the geography of Boulder County. The model excelled in urban areas, clustered around Boulder and other large municipalities contained within the greater county. The reason for this is likely that environmental features such as nearness to schools and recreational amenities are of greater importance for homes in the denser parts of the county, whose residents actively consider such elements when purchasing a home.
On the other hand, the model saw a sizable drop in accuracy in the rural parts of the county. Certain properties located further in the mountains were clearly sui generis, with prices that diverged significantly from the rest. The model struggled to account for these properties, likely because many of the features important in urban areas are simply inapposite in the rural context. Homebuyers who are seeking a mountain getaway house, for example, are less interested in their house’s proximity to schools.
Although it represents a fine starting point, the model in its present form would likely not be ready for deployment by Zillow. Beyond even the need to improve base metrics such as average error, the more fundamental problems identified in the Discussion section would need to be remedied before Zillow’s vast user base could rely on the model.
Thankfully, several areas for improvement can already be identified. To start, more features should be added. Data which were not included in the model but which would doubtless prove useful include crime data, school districts ranked by desirability, and other features that better account for the clustering effects of home prices, such as neighborhoods. Although these data may not be available in pre-packaged form from the open data sources used here, these obstacles are likely overcome by clever engineering.
To the urban–rural issue articulated in the previous section, one possible solution is to revise the model to predict initially for price per square foot rather than total price. Price per square foot better measures the effect of location on property value, as it is more comparable across properties irrespective of the physical buildings. When coupled with more variables that account for the spatial clustering of price, a price-per-square-foot metric would likely distinguish between urban and rural properties better than the current model.